Last week we kicked off our Meet Cassandra series – with a short introduction to meet and greet. Today, we’ll begin getting to know her by immediately get hands on with data modeling. Download Cassandra 1.2.2 and follow the below screencast to get it running.
Data modeling with Cassandra (C*) is dramatically different than with RDBMS. With RDBMS, a normalized model is created without considering the exact queries it will service, as SQL can return just about anything through joins. With C*, the data model is designed for the specific queries it will service; as new queries are introduced, the schema is adjusted accordingly. C* has no concept of joins, relationships, or foreign keys. Instead, a separate table is typically leveraged per query, and data required by multiple tables is denormalized across those tables.
Let’s consider a simple “Twitter-clone” application, focusing on two queries:
- What information does a user have in his/her profile?
- What tweets has a user posted?
Let’s make a model for these queries with Cassandra Query Language (CQL) from the cqlsh prompt where our screencast ended. While CQL mimics SQL, don’t let it lure you into thinking it has all the querying power of SQL – it truly contains just a subset. We need a Keyspace to house our tables. Create one with:
CREATE KEYSPACE demo WITH replication = {'class':'SimpleStrategy', 'replication_factor':3};
USE demo;
Don’t worry about the specifics of the above commands for now. Next, let’s create our tables:
CREATE TABLE users (
email varchar,
bio varchar,
birthday timestamp,
active boolean,
PRIMARY KEY (email)
);
CREATE TABLE tweets (
email varchar,
time_posted timestamp,
tweet varchar,
PRIMARY KEY (email, time_posted)
);
Our users table is keyed by user email, and has columns for bio, birthday, etc. We have specified a variety of data types for columns, (varchar, timestamp, …); these are but a subset all the supported types. The tweets table is keyed by both the email of the user who posted the tweet and the time the tweet was posted, and contains a column for the tweet message itself. Now let’s insert some data into the users table and read it back. Inserts are similar to SQL:
INSERT INTO users (email, bio, birthday, active) VALUES
('john.doe@bti360.com', 'BTI360 Teammate', 516513600000, true);
In case you’re wondering, timestamp fields are specified as milliseconds since epoch, hence the big numbers. Exit cqlsh with quit;
then add in some sample data from this file with cqlsh --file <file>
. You might see some errors regarding keyspaces/tables already created, ignore them. Re-enter cqlsh, and run SELECT * FROM users;
and SELECT * FROM tweets;
We should see something similar to:
USERS
active | bio | birthday | |
new.user@gmail.com | True | null | null |
john.doe@bti360.com | True | BTI360 Teammate | 1986-05-15 00:00:00-0400 |
bob@yahoo.com | False | Hi. | null |
TWEETS
time_posted | tweet | |
new.user@gmail.com | 2013-02-09 13:38:55-0500 | What’s a tweet? |
john.doe@bti360.com | 2013-02-09 13:30:01-0500 | VT FTW |
john.doe@bti360.com | 2013-02-09 13:43:48-0500 | Go orange. |
Our select statement for users show nulls where values were omitted for fields, and tweets shows new rows for each tweet for a user. In reality, cqlsh is lying to us. It is presenting the data in a Relational-esque view. Those nulls don’t actually exist, and neither do the extra rows.
Check out our next blog entry when we go behind the scenes and dive deeper into data modeling with Cassandra to see what sets her apart!